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Abstract 

This paper studies the problem of testing whether a function is monotone 
from a nonparametric Bayesian perspective. Two new families of tests are 
constructed. The first uses constrained smoothing splines, together with a hi- 
erarchical stochastic-process prior that explicitly controls the prior probability 
of monotonicity. The second uses regression splines, together with two propos- 
als for the prior over the regression coefficients. The finite-sample performance 
of the tests is shown via simulation to improve upon existing frequentist and 
Bayesian methods. The asymptotic properties of the Bayes factor for comparing 
monotone versus non-monotone regression functions in a Gaussian model are 
also studied. Our results significantly extend those currently available, which 
chiefiy focus on determining the dimension of a parametric linear model. 

Keywords: Bayesian asymptotics; Model selection; Monotonicity; Regression 
splines; Smoothing splines 



1 Introduction 



X 



Many authors have used Bayesian methods for estimating functions that are known 
to have certain shape restrictions, including positivity, monotonicity, and convexity. 



Examples of work in this area include Holmes and Heard (2003), Neelon and Dunson 



(|2004D, Dunson] (2005D , [Shively et al.| ( |2009D , [Shively et al.| ( |201lD , and [Hannah and 
Dunson ( 2012a[[b ). But an important question unaddressed by this work is whether it 
is appropriate to impose a specific shape constraint. If it is, the resulting function esti- 
mates are often considerably better than those obtained using unconstrained methods 



(Cai and Dunson, 2007). Conversely, the inappropriate imposition of constraints will 



result in poor function estimates. 
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This paper considers the use of nonparametric Bayesian methods to test for mono- 
tonicity. The approach has connections with two major areas of the Bayesian htera- 



ture. First, we extend work on regression and smoothing sphnes (Smith and Kohn 



1996 Dimatteo et al. , 2001 Panagiotehs and Smith 2008 Shively et al. 2009), and 



dictionary expansions more generally (Clyde and Wolpert , 2007), adapting these tools 



in such a way that they become appropriate for hypothesis testing. Second, we extend 



work on default Bayesian model selection (Berger and Pericchi 2001 George and Fos- 



ter 2000; Clyde and George 2004; Giron et al. , 2010) to the realm of nonparametric 



priors for functional data. 

Our main theoretical contribution is to characterize the asymptotic properties of 
the Bayes factor for comparing monotone versus non-monotone mean regression func- 
tions in a normal model. To date, such asymptotics have been exclusively orientated 
to comparing normal linear models of differing dimension. See, for example, [Giron 



et al. (2010). We generalize this work to the problem of comparing non- linear mean 



functions. We do so using a technique introduced in Walker and Hjort (2002) that 



combines the properties of maximum-likelihood estimators and Bayesian marginal 
likelihoods that involve the square roots of likelihood functions. 

Our main methodological contribution is to construct two new families of priors 
that are appropriate in a Bayesian test for monotonicity. The first approach uses a 
monotone version of smoothing splines, together with a hierarchical stochastic-process 
prior that allows explicit control over the time at which the underlying function's 
derivative first becomes negative. The second approach constructs a family of tests 



using a regression spline model (Smith and Kohn, 1996) with a mixture of constrained 



multivariate distributions as the prior for the regression coefficients. We study two 
possible choices for this prior: constrained multivariate Gaussian distributions, and 
a constrained version of the multivariate non-local method-of-moments distribution 



(Johnson and Rossell, 2010). The mixing parameters in both priors can be adjusted 



to allow the user to set the prior probability of a monotone function. 

These approaches work for a very general class of sampling models, including those 
for continuous, binary, and count data. They are most easily understood, however, in 
the case where data Y = {yi, ...,?/„}, observed at ordered points x G {xi, . . . , x„}, 
are assumed to arise from a normal sampling model, {t/i \ fi) ~ N(/j,o"^). Here 
fi = f{xi) denotes the value at Xi of an unobserved, real- valued stochastic process 
f{x), with X G [0, 1] and / in some function space J-'. To test whether f{x) is non- 
decreasing, one must compare the evidence for Hq : f E J^q against Hi : f E J^i, 
where 



To = {/ G J" : f{s2) > f{si) for all pairs S2 > sj , 

J^i = {f E J^ : /(S2) < /(si) for at least one pair S2 > Si} . 

Under the Bayesian approach to this problem, two important goals are in conflict. 
One goal is flexibility: we wish to make few assumptions about /(x), so that the 
testing procedure may accommodate a wide class of functions. Applying this principle 



naively would lead us to choose an encompassing function space T with large support, 
and vague prior measures Ho and Hi over Tq and T\^ respectively. On the other hand, 
the Bayes factor for comparing i^o versus li\ is 

^,{nf=i0(?/d/(^.),^^)} ^no(/) 

BF(-ffo : Hi) = 7 — ^ , 



which is heavily influenced by the dispersion of IIo and IIi ( Scott , 2009 ) . This strongly 
contra-indicates the use of noninformative priors for model selection. In particular, 
improper priors may not be used, as this leaves the Bayes factor defined only up to 
an arbitrary multiplicative constant. 

To balance these conflicting goals, we let J-" = C^, the space of all continuously 
different iable functions. Membership in C^ is easily enforced by supposing that f{x) = 
Jq g{s) ds, where g{s) G C, the space of real- valued stochastic processes with almost- 
surely continuous sample paths. The test may now be phrased in terms of Hq : g{s) G 
C^ versus Hi : g{s) G C~ , where 

C+ = {g eC: g{s) > for all s G [0, 1]} , 

c- = c\c+ . 

Attention is then focused on the choice of prior for g{s). The closest proposal 



to ours is that of Dunson (2005), who used a Bonferroni-like method for controlling 
the overall probability of monotonicity via a semiparametric model akin to a variable- 
selection prior. There is also a large body of classical work on specification tests, many 
of which are appropriate for assessing monotocity or more general aspects of functional 



form. Two examples of work in this area are the papers by Zheng (1996), who 



proposed a test of functional form based on the theory of U-statistics, and Bowman 



et al. ( 1998 ), who used a two-stage boostrap test for monotonicity based on the idea of 
a critical bandwidth parameter. Our simulation study finds that the new approaches 
proposed here outperform these methods, in that they have better power to detect 
departures from monotonicity at a fixed false-positive rate. 



2 Constrained smoothing splines 

A straightforward nonparametric Bayesian strategy to estimate a continuous function 
is to model the derivative g{x) as a scaled Wiener process prior, implying that the 
increments g{s2)—g{si) are normally distributed with mean zero and variance r^(s2 — 
Si). But this is inappropriate for model selection, since it places unit prior probability 
on the hypothesis of non-monotonicity. To see this, recall that for every positive c, 
the sample path of a Wiener process almost surely takes both positive and negative 
values on (0, c). 

We therefore adapt this smoothing-spline approach to the testing problem. Let 



^ = infs>o{'5 '■ g{s) = 0} be a latent variable denoting the first downward crossing 
point of g{x). This maps directly onto the hypotheses of interest: ^ > 1 if and only 
if Hq is true, since the derivative g{s) will be strictly positive on the unit interval. 

The introduction of ^ turns a hypothesis test for an infinite-dimensional object / 
into a one-dimensional estimation problem, via a hierarchical model for f{x): 



X 

2\ 



iy^\f) ~ N(/(x,),a^), f{x)= / g{s)ds 







(^|e,r) ~ n{g\^,T), e~p(0 



with obvious modification in the non-Gaussian case. Here Il{g \ C,, t) denotes the 
conditional probability distribution of a Wiener process with scale parameter r, given 
infs>o{s : g{s) = 0} = ^. This differs from the one-dimensional Brownian bridge, in 
that g{x) is restricted to be positive over the interval (0,^). 

We refer to the overall approach as a constrained smoothing spline. To test Hq 
versus Hi, we compute the posterior distribution for ^, and identify the posterior 
probability of Hq as pr(,^ > 1 | F). One default choice for p{C,) is a standard log- 
normal distribution, but any proper prior with a median of 1 could be used. 

As the data are observed on a discrete grid, it is necessary to characterize the 
increments of the above stochastic-process prior. The following proposition describes 
the distribution of these increments, here referred to as the fractional normal distri- 
bution. 

Proposition 1. Let Il{g \ ^,t) denote the conditional probability measure of a scaled 
Wiener process g{s) with scale parameter r, given the condition that the first down- 
ward crossing of g{x) occurs at x = ^, where conditioning is meant in the sense of 
Doob's h transform. Let < Si < ,^, and define gi = g{si) and go = g{0). The 
conditional distribution p{gi \ go,C,,T) arising from Il{g \ ^,r) is a fractional normal 
distribution, denoted FN{gi \ go,C,,T), with density 

P{gi I ^0, ^, t) = ^^^ r^ gi smh{mgi/h) ^xp < -^ ^ , (2) 



where h = t{^u{1 - m)}^/^ and m = (5fo/r){(l - u)/{^u)Y^'^. 



Proof. The proof generalizes the argument of Chigansky and Klebaner (2008) to 
a scaled Wiener process, and follows their approach closely. Let \^ be a three- 
dimensional Brownian bridge with scale r, with V{0) = v and V{C,) = 0, for some 
point V having Euclidean norm ||f || = go. That is, 

V{s)=v + W{s)-^{W{0 + v} for s<e, (3) 

where W{s) is a Wiener process in three dimensions with scale parameter r. Observe 
that the random variable of interest is equal in distribution to the radial part of V, 



observed at Si: 



igl\^,9o) = \\Vis^] 



Moreover, since ||\^|| is independent of the the starting point v, as long as \\v\\ = go, 
one may choose v = ((70, 0,0) to simphfy the calculations. From ([s]), this leads to 

r 2 ■^ ^/^ 

\\V{si)\\ = I (^z^Ty/u{l-u)^ + ^1(1 - «)) + ^^(1 - uyzj + ^u{l - uyzl 

(4) 
where u = [C, — Si)/{C, — Sq), and where {zi, Z2, z^) are independent, standard-normal 
draws. Equivalently, 



\\Vis^)\\^h{iz, + mY + z| + z|} 



with m and h defined as above. The density z^ + ^3 is that of an exponential distribu- 
tion with rate 1/2. Meanwhile, the density of the term t] = (zi+m)^ may be evaluated 
directly using the fact that p{r]) = ^-P(^ < u) and differentiating under the integral 

sign. After a simple change of variables, the density oi 9 = [{zi + m) + 2;| + z|} 
may be computed via convolution as 

f,{9) = ^- esmh{m9) exp(-^V2) • 

mx/Tr 



The result follows from the fact that the density of gi = hO is p{gi) = fc{6/h)/h. D 

Proposition [1] characterizes the distribution of the increments of g{x) for a given 
^, and leads to an efficient sequential Monte Carlo algorithm for fitting the model. 
The details of the algorithm are in the appendix. 

3 Constrained regression splines 

This section develops a second family of tests for monotonicity using a regression 
spline model with different prior distributions for the regression coefficients. A finitely 
parametrized approximation to the function f{x) is given by the spline expansion 

fm{x) =a + l3iX + I32X^ + (5^{x - Xi)\ H h (3m+2{x - Xm)\ , (5) 

where the Xj are the ordered knot locations, and z+ indicates the positive part of z. 
Throughout this section we will assume the model r/i = fm{xi) + Cj for some choice of 
m. For notational convenience let xq = and Xm+i = 1- Clearly the first derivative 
fm{x) is linear between each pair of knots. Therefore, ii f^{xj) > and f^{xj+i) > 0, 
then f^{x) > on [xj, x^+i]. If this condition holds for all j = 0, . . . , m, then fm{x) 
is monotone on [0, 1]. 



To develop a test for monotonicity based on regression splines, we follow Smith 



and Kohn (1996) and Shively et al. (2009), using Bayesian variable selection to choose 
the location of the knots from a set of m pre-specified values. Using all m knots, (pi) 
may be re-written in matrix notation as y = al + X(5 + e, where y is the ri x 1 vector 
of observations, 1 is a vector of ones, X is an n x (m + 2) design matrix, and /3 is 
the vector of spline coefficients. Let i be a vector of indicator variables whose jth 
element takes the value if Pj = 0, and 1 otherwise. Let (3^ consist of the elements of 
(3 corresponding to those elements of l that are equal to one, and let p = |i| denote 



the number of nonzero entries in l. Shively et al. (2009) derive the constraints on /3t 



that ensure the monotonicity of fm{x) for a given l. Specifically, fm{x) is monotone 
whenever L^P^ = 7^ > 0, where L^ is a known lower-triangular matrix that depends 
on i and the x^-'s. Checking whether any element of 7^ is negative, and controlling 
the prior probability of such a constraint violation across all possible values of i, are 
the basis of our test for monotonity. The matrix L^ therefore plays an important role 
in the prior distribution for /J^. 

Given 6, the (3^ space is divided into 2^ disjoint regions denoted Rl ,. . . ,Rl , 
with each region defined by a different combination of signs of f^i^) ^^ each of the 
included knots. Without loss of generality we may let R[ denote the region where the 
derivative is non-negative at each of the included knots, in which case /^(x) > for 
all X G [0, 1]. For a specific l and prior p{(3^), one may compute the prior probability 
pr(/3t G -R!: I l), which is identical to pr{/,^(a;) > for all x G [0, 1] | l}. 

The key feature of our approach is that we are able to specify priors on l and P^ 
that allow explicit control over the prior probability of a monotone function. Given 
these priors, we compute pr{//^(x) > for all x G [0, 1] | y}, the posterior probability 
of a monotone function with l and (3^ marginalized out. The function is declared to 
be monotone if this probability exceeds a specified threshold. 

We now describe the overall approach for constructing p{l) and p{(3u)- The Lj are 
assumed to be independent with pr(6j = 0) = pj. Given l and the error variance a"^, 
the prior for (3^, is a discrete mixture of 2^ multivariate distributions, 

d=l 

where 11^ is constrained to have support on Ri . Within each component of the 
mixture, there is a fixed combination of signs of fln{x) at each of the included knots. 
As Rl corresponds to the region where flnix) > for all x, qi controls the prior 
probability of monotonicity. 

Two specific choices for 11^ are considered: one based on the multivariate Gaussian 
distribution, and the other based on the multivariate method-of-moments distribution 



described by Johnson and Rossell (2010). Both priors involve the constraint matrix 
L^ in order to ensure that the necessary integrals are analytically feasible. This is 
analogous to the use of gf-priors in ordinary linear regression. In the appendix these 
priors are constructed in detail, and a Markov-chain sampling algorithm is presented 



for sampling from the joint posterior distribution over all model parameters. 



4 Asymptotic properties of Bayes factors 

4.1 Independent and identically distributed models 

This section develops the asymptotic properties of the Bayes factor in a test for mono- 
tonicity. This analysis requires a new approach that significantly extends previous 
methods used to study Bayes factors in the context of parametric linear models. 

To introduce the new approach, we first consider the case of independent and 
identically distributed observations. Suppose two Bayesian models, Mi and M2, are 
to be studied and compared via the use of a Bayes factor: 

In the following, dxif^g) = J f^og{f/g) is the KuUback-Leibler divergence between 
/ and g, and 



dH{f,g)=y{^/f-v9y] 



1/2 



is the Hellinger distance between / and g, which is bounded by 2. 

Theorem 2. Suppose that the data Xi, . . .Xn are assumed to arise from some true 
density fo{x), and that models Mi and M2 are to be compared, where Mj = {fj{x \ 
9j), TTjiOj), 6j G ¥j}. First, suppose that the true fo{x) is in the KuUback-Leibler 
support of 712 ■ That is, for all e > 0, 

vr2[^2: t/^{/o(-),/2(-|^2)}<e]>0. (7) 

Second, suppose that for all sufficiently large n and for any c > 0, the following 
bound holds almost surely under fo{x): 



sup 



9ieFi ^^, 



n^%^<^-- («) 



fo{x^ 



Finally, suppose that 



inf rfH{/i(-|^i),/o(-)}>0. (9) 



Then B12 — ?■ almost surely under /q. 

Proof. To prove the result, consider the denominator of B12 in Equation ([6]) with a 
factor introduced, which is also introduced to the numerator, and so the factor cancels 






out: 

a/2(a;i I O2) 

It is well known that with the first condition of the theorem, I„,2 > e~"^ almost surely 
for all large n and for any r > 0. See, for example, Schwartz (1965). 
Now consider the numerator with the same factor introduced: 






We can write this with an upper bound as follows: 

The second condition will ensure that the first term remains bounded. The second 
term, labelled as J„i, has expectation 



E{Jnl) < 



i-¥H{fi-\Oi)Joi-)} 



TXi{ddi 



Hence, with the third condition, for some ?7 > 0, E{Jni) < e~"^ and thus J„i < e"""*^ 
almost surely for all large n and for some 5 > 0. 
Putting these together, we have 

B12 < exp |— n((5 — \c — t)^ almost surely for all large n for any c, r > 0. 

Choose |c + r < 5 to obtain the desired result. D 

Using a similar argument, it is also possible to show that B12 — ?■ 00 almost surely 
if the following three conditions hold. First, the true fo{x) is in the KuUback-Leibler 
support of TTi; that is, for all e > 0, 

TTii^i: rfA'{/o(-),/i(-|^i)}<e]>0. 

Second, 

" f ix- \9 ) 
sup TT — — -^ — r — < e"'^ almost surely for all large n for any c > 0. 

e2&2~J{ JoyXi) 



Finally, 



inf rfH{/2(-|^2),/o(-)}>0. 



4.2 Regression models: monotone vs. non-monotone 

We will now adapt this result to examine the regression spline model introduced 
previously. Suppose that Mi is a normal model with monotone mean function, 



Ml = {gi{y I x,0i) = N{y \ /i(x),af), 7ri(/i, a^)}, 

and M2 is a normal model with non-monotone mean function, 

M2 = te(y I x,92) = N{y I Ux),al), 7r2(/2, a^)}. 

Assume the Xi are sampled from distribution Q with support on (0, 1). Let / be a 
regression spline function, which can be monotone or non-monotone, and define the 
sieve 

S'^ = < f E C : f has knot points at the (xj) and / |/'(x)|^ dx < An > 

for some A„ ^ +00, where C is the space of continuous, piecewise linear functions on 
(0,1). 

Finally, define 



B 



12 



!Y{tiN{y.\Ux.:),al}7:,{de2) 



where 6j = {fj, aj). 

Our main result is stated below. 

Theorem 3. Assume that fo, the true regression function is bounded and continuous 
on (0, 1) and, if monotone, has a derivative bounded away from 0. Suppose that 
An = 0{n^/'^~^) for some 5 > 0; that the L2 support of the prior for f, which may 
depend on the sample size, coincides with S'^; and that the support of the variances 
coincides with ai,a2 < <J+ < C)0. Then, if the true model lies in Mi, B12 diverges 
almost surely, whereas if the true model lies in M2, then B12 — )■ almost surely. 

The full proof is somewhat technical and is provided in the Supplementary mate- 
rial. The essential ideas are conveyed by the proof of Theorem 1, while many of the 



technical details draw on the method of sieves studied by Geman and Hwang (1982). 
We now briefly describe the connection between the theorem and our regression- 
spline approach, and the way the conditions of the theorem correspond to the three 
conditions of Theorem 1. 

The prior for /, say 'k^'^\ will depend on the sampled Xi and actually have as 
support the functions in S'^. In order to ensure the true model is in the Kullback- 
Leibler support of the prior for all large n, we assume the true variance is bounded 
by cr+, and that for all large n, /o is in the L2 support of the prior. That is, for all 



e > 0, and sufficiently large n, 

vr^") |/ : Jifoix) - f{x)fQ{dx) < e| > . 
In the proof of Theorem 2 we need the condition that for some t > 0, 

e'\y\FY{y)dy<oo, 



where Fy is the true marginal distribution of the y. This condition on Fy and A„ 
is to ensure that the sieve maximum likelihood estimator of /o exists and converges 
appropriately for our purposes. Here 



FY{dy) = J N{dy\fo{x),al)Q{dx), 

and hence the condition on Fy will be satisfied if 

/ exp {/o (a;)} Q{dx) < oo . 

This holds when /o is bounded, which we assume in Theorem 2. 

When /o is monotone, we need to assume that /g is bounded away from to 
ensure that the probability a non-monotone function can get arbitrarily close to it is 
sufficiently small. We can not ensure this sufficient smallness in probability if /o is 
flat. The idea is that if the non-monotone function has probability 1/2 of increasing 
or decreasing at every Xi point, then the probability it increases n — 1 times but 
decreases once is exponentially small. On the other hand, it is not possible for a 
monotone function to get arbitrarily close to a non-monotone function. 



5 Experiments 

5.1 Description of simulations 

This section reports the results of an extensive simulation study benchmarking the 
proposed methods against three alternatives that previously appeared in the litera- 



ture. The first two are well-known in the classical literature: the U test from Zheng 



(1996), and the bootstrap-based test from Bowman et al. (1998). Our third bench- 
mark was the method of Bayesian Bonferroni correction proposed by Dunson (2005) 
where each increment of / is as 

{6i\w,T^) ~ wN+{6i\0,T^) + {l-w)N{6,\0,T^), 
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Figure 1: The Bowman et. al. test function with a = 0.78. 



where w is the mixing probabihty and N"*" indicates a normal distribution truncated 
below at zero. By analogy with Bonferroni correction, one then chooses u; as a 
function of the sample size such that the event {6i > for all i} has prior probability 
1/2. This model can be fit straightforwardly via Gibbs sampling. 

Our simulations used the model i/i = f{xi) + e^, with n = 400 equally spaced x 
values on (0,1], and the e^ independent and identically distributed A^(0, 1) random 
variables. We generated 100 data sets for each of the 13 functions listed in the left- 
most column of Table [Tj Five of these functions are monotone, but either fiat or very 
nearly fiat compared to the chosen error variance of a^ = 1. The remaining 8 functions 
are non-monotone, but only just. The goal of the experiment was to assess the 
performance of all methods across a range of situations with low signal-to-noise ratios. 
The first 11 test functions involving simple linear, quadratic, and exponential forms. 
The last two test functions are of the form f{x) = l + x — aexp{—0.5(x — 0.5)^/0.1^}, 
which is used by Bowman et al. (1998) in their simulation study. Setting a = 0.78 



gives a decrease of 0.5 from the maximum of the function on the interval (0, 0.5) to 
the minimum of the function, while setting a = 1.05 gives a corresponding decrease 
of 0.75. See Figure [l] for the case where a = 0.78, along with one simulated data set. 
In Table [T] we refer to these functions as BJG, for Bowman, Jones, and Gijbels. 

For the two regression spline-based tests developed in Section 3, we used m = 9 
equally spaced knots, and set c = 5, pj = pr{Lj = 0) = 0.75, and qi = pr(/^(x) > 
I t) = 0.5. The simulation results indicate that these values give procedures with 
good small-sample classification properties for a wide range of functions. The MCMC 
sampling scheme for each procedure was run for a burn-in period of 20,000 iterations 
and a sampling period of 100,000 iterations. 

For the constrained smoothing-spline approach, we used a particle filter with 
50,000 particles. We assumed a Gamma(2,2) prior on r^, the variance of the un- 
derlying Wiener process for f'{x). For the crossing time ^, we assumed a standard 
log-normal prior, which has 50% of its prior probability on the unit interval. The 
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Figure 2: Receiver operating characteristic curves for the different methods, evalu- 
ated across all 1300 data sets simulated under the functions listed in Table [H The 
diagonal line is the line of unit slope, while the vertical dashed line is at a =0-05. 

plug-in estimator described in Section 2 was used for the error variance a^. 

5.2 Results 

The receiver-operating characteristic curve in Figure |2] provides a common scale to 
judge the overall performance of the various methods across the range of signal classes 
considered here. To calculate this curve, observe that all tests can be phrased in terms 
of a test statistic t and a critical value t*, where the null hypothesis of monotonicity 
is rejected if t < t*. For all Bayesian methods, we used the Bayes factor against 
monotonicity as the test statistic. For the bootstrap and U-tests, we used the p-value 
under the hypothesis of monotonicity as the test statistic. At a given critical value 
t*, the realized false positive rate and false negative rate may be calculated across all 
1300 tests. Thus we are calculating an average performance across the 13 different 
functions, some of which are monotone and some of which are not. The curves are 
then traced out as one varies the critical value independently for each method. 

As Figure |2] shows, the Bayesian methods proposed here exhibit the best overall 
frequentist properties for functions among those considered. At an a level of 0-05, 
the smoothing-spline approach has similar power to the U test, and noticeably better 
power at all higher a levels. The regression spline method had uniformly higher 
power than the other methods across all a levels, regardless of whether the Gaussian- 
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Figure 3: The empirical sampling distribution of p-values under the null hypothesis 
for the bootstrap test, left panel, and U test, middle panel. In neither case is the 
distribution uniform. Right: actual versus nominal size 

or moment-based prior is used. The Bayesian-Bonferroni method and the U tests 
behaved similarly overall, with the U- test giving slightly better power for most a 
levels. The bootstrap-based test performed the poorest, having the least overall area 
under the receiver-operating characteristic curve. 

Figure [3] shows the empirical distribution under the null hypothesis of the p- 
values calculated using the bootstrap test and U test. There were 500 such cases 
in our simulation where the null hypothesis was true. These 500 p-values were not 
uniformly distributed on (0, 1) under either method, although their distribution under 
the U test was much closer to uniform. These departures from uniformity account 
for the phenomenon observed in the right panel of Figure 3, which shows the realized 
false-positive rate for the bootstrap and U tests as a function of the nominal size of 
the rejection region. In practical terms, this plot suggests that, when using the U 
test, one may need to choose a rejection region of nominal size a > 0-05 in order to 
produce an actual false-positive rate of 0-05 under the null. 

Table [l] shows more detail about the results of the simulation study for each 
function. We calibrated the rejection region for each method to have a realized a 
level of 0.05 across all functions, indicated by the vertical dashed line in Figure [2] 
This calibration is necessary to make a fair comparison across the different functions, 
and as shown in Figure |3| is not the same as the nominal a level for the bootstrap 
and U tests. We then tallied the number of correct classifications for each function at 
this calibrated critical value. A few interesting differences emerge. For example, the 
U test has superior power for the Bowman et. al. functions compared to the Bayesian 
methods, while the smoothing-spline approach does worse for the functions with dips 
near x = 1. 

One further aspect of the Bayesian method is that it can be used to simultane- 
ously test global and local hypotheses about the behavior of the function, without 
invoking any concerns about multiplicity. This becomes clear when examining a spe- 
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Table 1: The entries in the table show the percentage of times in 100 Monte Carlo sim- 
ulations that the given function was classified correctly, with each method's rejection 
region calibrated to have a size of a = 0.05 across all 13 functions. RS: constrained re- 
gression splines (Gaussian and MoM priors). SS: constrained smoothing spline with 
Wiener-process prior. BB: Bayesian Bonferroni. Boot: bootstrap-based test from 



Bowman et al. (1998). U: nonparametric U-test from Zheng (1996). 













Method 






Function 


Monotone? 


U 


Boot 


SS 


BB 


RS (Gauss) 


RS (MoM) 


-0.292 exp(x3) 


No 


15 





91 


19 




92 


85 


-0.146 exp(x3) 


No 








48 







44 


33 


0.146 exp(x3) 


Yes 


98 


98 


97 


98 




97 


97 


0.292 exp(x3) 


Yes 


91 


87 


98 


93 




100 


100 


Flat 


Yes 


100 


99 


85 


99 




85 


90 


+ 4.0(x- 0.75)2 


No 


94 


81 


15 


96 




54 


58 


+ 8.0(x- 0.75)2 


No 


97 


92 


36 


99 




70 


74 


-0.50x 


No 


27 





91 


28 




92 


86 


-0.25x 


No 


2 





58 


4 




46 


34 


0.25a; 


Yes 


99 


96 


99 


95 




97 


95 


0.50a; 


Yes 


88 


94 


100 


91 




97 


94 


BJG, a = 1.05 


No 


99 


92 


60 


71 




87 


89 


BJG, a = 0.78 


No 


98 


65 


20 


29 




31 


35 



cific case. In Figure |4] we see two examples of data sets in our simulation study: 
one where f{t) = 1 + 8.0 (t — 0.75) 2, which is non-monotone; and the other where 
f(t) = 0.5t, which is monotone. For each of these two data sets, the U-test rejected 
the null hypothesis of monotonicity at the a = 0.05 level. The Bayesian test using 
constrained Brownian motion also favors the hypothesis of non-monotonicity for each 
case: posterior probability 0-98 for the function in the top pane, and 0-74 for the one 
in bottom pane. 

The problem comes when we try to identify the range of values where / ex- 
hibits likely non-monotonic behavior. The U-test, intuitively, asks whether an un- 
constrained estimate of f{t) significantly improves the residual sum of squares, versus 
a constrained estimate. In this sense it behaves like an F-test in a classical analysis 
of variance problem. It is an omnibus test, and leads to a large multiple-comparison 
problem if one wishes to test for specific local features of the data set that yielded 
a rejection from the omnibus test. In contrast, the posterior probability associated 
with any such local question, such as whether the function decreases between 0.6 and 
0.8, arises naturally from the joint posterior distribution under the Bayesian model. 
One may ask an unlimited number of such questions of the posterior distribution, 
without posing any multiplicity concerns. 
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Figure 4: The left two panels show results for a data set where E{y) = 1 + 8.0(t — 
0.75)^. The first panel shows the data as grey dots, the filtered mean and 95% filtered 
confidence interval for / as dotted lines, and the smoothed mean as a solid line. The 
second panel shows the histogram of draws from the posterior distribution of the first 
crossing time ^. The right two panels show the same two figures for a data set where 
E{y) = 0.5t. 

A Further detail on constrained smoothing splines 

Proposition 1 in the main manuscript characterizes the distribution of the increments 
of the stochastic process g{s) for a particular value of the crossing time ^. It naturally 
suggests a sequential Monte Carlo algorithm for model-fitting. Let Aj = tj+i — tj 
denote the jth time increment, and approximate the value of / at time tj as f(ti) ~ 
'Ylij<,i9{^j)^ji assuming that the time bins are sufficiently small such that g{tj)Aj is 
a good approximation of J^^'*'^ g{s)ds. 
Subject to this approximation. 



{9^\r,tgi-l) ~ <; N((7. |0,r2(t,-O) 

N(^. I gi-i,T^itj - tj_i)) if tj^i > ^ , 



if tj<t 



if tj-i <^ <tj 



(10) 
(11) 

(12) 



recalling that FN denotes the fractional normal distribution from Proposition 1. The 
model is characterized by the unobserved state vector g = {gi, . . . , (^at)^, and the three 
unknown parameters ^, r, and a"^. To fit it, we use the particle-filtering algorithm from 
Liu and West (2001), which is a modification of sequential importance resampling. 



The idea is to introduce a particle approximation to the time-t filtered distribution 
of the state gt and unknown parameters r and C,- The problem of particle decay is 
handled by propagating each particle forward with a small jitter added to ^ and r, 
drawn from Gaussian kernels centered at the current values. This would ordinarily 



result in a density estimate that is over-dispersed compared to the truth. But Liu and 



West (2001) suggest a shrinkage correction that shifts the propagated values toward 
their overall sample mean, resulting in a particle approximation with the correct 
degree of dispersion. 

With a moderate amount of data, a^ can be estimated quite precisely. We use 
a simple plug-in estimate, derived from an unconstrained local-linear-regression esti- 
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mate of /, with the kernel bandwidth chosen by leave-one-out cross-validation. This 
produces an estimate for a^ that is asymptotically efficient, regardless of whether the 
underlying function is monotone. We also experimented with the particle-learning 



approach of Carvalho et al. (2010), by tracking sufficient statistics for a^ as part 
of the state vector. Only very small differences between these two approaches were 
observed, seemingly validating the simpler plug-in method. Code implementing the 
method is available from the authors. 



B Further details on constrained regression splines 

B.l The model 

This section develops two tests of monoticity using a regression spline model with 
different prior distributions for the regression coefficients. Equation (5) from the main 
manuscript specifies a finitely parametrized approximation to the function /(x), and 
introduces our notation for the problem. For the sake of clarity, we re-iterate this 
notation here. We represent the unknown function as 

f.m{x) =a + PiX + (32X^ + (33{X - Xi)l H h (3m+2{x - Xrn)+ , 

where the Xj are the ordered knot locations, and z+ indicates the positive part of z. 
Using all m knots, this model may be re- written in matrix notation as y = al+XfS+e, 
where y is the n x 1 vector of observations, 1 is a vector of ones, X is an n x (m + 2) 
design matrix, and /3 is the vector of spline coefficients. Let i be a vector of indicator 
variables whose jth element takes the value if (3j = 0, and 1 otherwise. Let /3t consist 
of the elements of /3 corresponding to those elements of l that are equal to one, and 



let p = \l\ denote the number of nonzero entries in t. Shively et al. (2009) derive the 
constraints on /^^ that ensure the monotonicity of fm{x) for a given l. Specifically, 
fm{x) is monotone whenever L^P^ = 7t > 0, where L^ is a known lower-triangular 
matrix that depends on l and the x/s. 

(1) (2^) 

Given t, the /3t space is divided into 2^ disjoint regions denoted Rl ,. . . ,Rl , 
with each region defined by a different combination of signs of the derivative f^{x) 
at each of the included knots. In general, if the sign of f^i^) ^^ any included knot is 
negative, then the function is not monotone. Without loss of generality we may let 
R[ denote the region where the derivative is non-negative at each of the included 
knots, in which case /^(x) > for all x G [0, 1]. For a specific l and prior p(/3J, one 
may compute the prior probability pr(/3t E Rl ' \ l), which is identical to pr{/^(a;) > 
for all X G [0, 1] | l}. 

Given l and a"^, the prior for /3 is a mixture of 2^ distributions, with the dth mixture 
component 11^ constrained to have support on Ri . Two specific choices for 11^ are 
considered: one based on the multivariate Gaussian distribution, and the other based 



on the multivariate method-of-moments distribution described by Johnson and Rossell 



(2010). As discussed below and shown in the simulation experiment in Section 5 of 



the main manuscript, the two resulting tests have different small sample properties. 
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We first note that a multivariate Gaussian distribution for the regression co- 
efficients is inappropriate for testing purposes, because for any given region Rl , 
pr(/3i E Ri \ l) will be very small if there are a large number of knots in the model. 
Because R[ is identified with the region where f{x) is monotone, this means there 
is a vanishingly small prior probability on the null hypothesis as the number of knots 
increases. 

We therefore take the following approach. For a given l not identically 0, let L^ 
be the lower-triangular matrix that defines the monotonicity constraint on P^. Let c 
be a fixed scale factor. The Gaussian-based prior we use is 

(/3, I i,a') ~ f^g, Tn{a | 0, ca2(L,Lf)-\ i?^} , 

where TN(x | m, V, R) denotes the density function, evaluated at x, of the multivariate 
normal distribution with mean vector m, covariance matrix V, and truncation region 
R. The integral of the density function over each of the 2^ mutually disjoint regions 
may be done analytically, due to the choice of covariance matrix. In particular, 
pr(/3t E R[ \ i) = Qd, so conditional on l, gi is the prior probability of a monotone 
function. A reasonable choice here is gi = 1/2, with all the remaining g^'s being 
equal. A second alternative is to choose the g^ values so the prior probability that 
the derivative function crosses zero between any pair of knots is the same. 

As a second choice of prior, we also consider the multivariate moment-based prior, 

{^\i,a') ~ aX^g./3fS:VjS,|-i/2exp|-i/3j^S:V.|, 
d=i ^ ^ 

where C^ is a known normalizing constant not depending on /3, and where S^ = 
ca'^{LiLj) as before. The integral of this density function over each of the sub- 
regions can be done analytically to give pr(/3t G Rl \ l). As in the parametric 



linear case considered by Johnson and Rossell (2010), using the moment-based prior 



gives a test for monotonicity with different properties than the test obtained using a 
Gaussian prior. These differences are illuminated in the simulation results reported 
in Section 5 of the main manuscript. 

To complete the model for fm{x) we must also specify priors for a, a'^ and l. 
The intercept a is given a vague mean-zero normal prior with variance 10^°, and the 
variance is given a fiat prior on [0,10^]. Using vague priors on these parameters is 
acceptable, as they appear in all models under consideration, and there is no inde- 
terminacy in the Bayes factor as a result. The prior for the knot-inclusion indicator 
L was discussed in the main manuscript; each element is given a Bernoulli prior with 
fixed probability pj. 

The prior probability of monotonicity may then be computed as 

Wifmi.^) > for all a; G [0, 1]} = q, [1 - pr{. = (0, . . . , 0)}] + pr{. = (0, . . . , 0)} , 

since gi = pr{/^(x) > for all x G [0, 1] | t}, and since a fiat function with l = 
(0, . . . , 0) is also a monotone function. 
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B.2 Sampling scheme for posterior inference 

To construct our Markov-chain Monte Carlo sampling scheme, we use an alternative 



parametrization of a regression spline model proposed in Shively et al. (2009). For a 
given i, define W^ = X^L~^ and 7^ = L^f3^, where L^ is the constraint matrix defined 
previously, and where X^ consists of the columns of X corresponding to the nonzero 
entries in l. This allows us to rewrite the function as fm{x) = al + W^t7i, and to 
identify the regions Rl as orthants in the transformed 7 space. In particular, the 
region of monotonocity is the first orthant, where 7^ > 0. This greatly simplifies the 
sampling scheme. 

Under this new parametrization, the Gaussian-based prior becomes 

(7. I i,a') ~ 5^g, TN{7. | 0,ca'l, Ri'^} , 

d=l 

while the moment-based prior becomes 

(7, I i, a^) ~ C, ^ gd 7^7. exp <^ -^"^7^7. [ • 
d=i I ccr J 

Sampling a and a^ is straightforward, so the details of these steps are omitted. 
We now discuss the details for sampling t and 7^ under the multivariate method-of- 
moments prior, the details for the Gaussian-based prior being similar. 

Let ij denote the jth element of l, and t_j denote the remaining j — 1 elements. 
Similarly, let 7^ denote the jth element of 7, and 7_j the remaining elements. We 
sample each (ij,7j) jointly, given (t_j,7_j), the data, and all other model parameters. 
To keep notation simple, we let denote the complete set of other model parameters, 
including the entries in l and 7 not being sampled. 

We generate (ij, 7j | y, 6) by first generating {lj \ y, 6) marginalizing over 7^, and 
then generating (7^ | Lj,y,Q). 

Without loss of generality assume that we are updating li. To compute p{li = | 
y,Q), observe that if at least one element of i_i is nonzero, then 

p(m = 0\y,Q)= Cp{y Ui = 0, 9) p(7-i Ui = 0, l^i) p{l^ = 0) , (13) 

where C is a constant. The second term on the right-hand side is 

p(7-i I ii = 0,._i) = g2^(cs)-S^i7-i(27rc)-^/2gxp |-l7^,7_i 

where s = \i-i\ is the number of nonzero elements in i_i; and where g = gi if all 
elements of 7_i are positive, and (1 — gi)/(2* — 1) otherwise. If all elements of i_i 
are zero, then the same representation holds with p(7_i | ^i = 0, i_i) set to one. 
To compute p{li = 1 \ y,Q), note that 



p{li = 1 I y, 6) = C <j I p{y\Li = l, 71, 9) p(7i, 7-1 Ui = 1, i-i) d-ii \ p(ti = 1) , 

(14) 



where C is the same constant appearing in Equation (13). Let 5i = {y — al — 
iy„i7_i) — Wi7i, with wi representing the first column of W^i,t_i, and W-i the re- 



maining columns. The first term in the integrand in (|14) may be written as 

1 



p{y\ ii = l,e) = (27ra2)-"/2exp 
Also, p(7i,7_i \ Li = 1, 6-i) is given by 



2a 



1 



,6J6 



a(7i) r{rcr\j! + 7^17-1) (Svrc)-'-/^ exp |-^(7? + 7-i7-i)| , (15) 

where r = s + 1, a(7i) = g if 71 > 0, and (1 — gi)/(2^' — 1) if 71 < 0. Recall that 
g = gi if all elements of 7_i are positive, and (1 — qi)/{2^' — 1) otherwise. 
Let y = y — al — Vr_i7_i, and let 



d = r{cr)-'c-'/\2nc)-'/\2nay/^ 



exp 



2c 



T 

7-17-1 






-1/2 



P(^i 



Then ( 14 ) can be written as 
p{li = 1 I y, 0) = 



X 



Cdexp 

1-gi 
2^- 1 



T~\2 



1 fry- (w* y) 



20"^ [ wf Wl + cr^/c 
^00 

Hli) d'ji + q /i(7i) (i7i 
00 Jo 



(16) 



where 



Hli) = (7i' + 7^i7-i)(27rT2)-i/2exp 



2r2 

7 



(71 - f^-,f 



7i 

/^7 



(w^wi) w^y 



-l„..T~. 



{wi wi + a /c) w^ y 



The two integrals with respect to 71 may be done analytically. 
If Li = 0, then 7i need not be generated. If li = 1, then 

pill \y,^i = 1, 0) oc p{y I 61 = 1, 71, 6) p(7i, 7_i | l) 



with p(7i,7_i I l) already given in (15). This is a mixture distribution that can be 
sampled by an efficient accept/reject algorithm. 

The details for the Gaussian prior for very similar, with the modification that the 
term 'yf + 'yTil-i does not appear in front of the normal kernel, and that drawing from 
the mixture distribution above may be done using constrained Gaussian distributions, 
without appealing to an accept /reject algorithm. 
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C Proof of Theorem 2 

We start with a Lemma which provides a useful result for both /o G Mi and /o G M2. 

Lemma 1. If /„(x) is the maximum likelihood estimator of the regression spline, 
including both monotone and non-monotone parts, and 

n 

'$^(/(^.)-/o(a;i))'^0 almost surely, (17) 



n 



then 



and 



n 

liminf inf n^^ y (yi — fi{x,i)Y > ctq almost surely, 

" /i ^ 

1=1 



n 

liminf inf n^^ \ [yi — f2{x.i)Y > ctq almost surely. 
1=1 



Proof. If (17) holds then we have, from the triangular inequality, and removing 



almost surely from what follows, that 

n n n 

n-' Y.^y. - h{x.)f < n-' J2{y^ - fix.))' + n'' J](/(x.) - Mx,))' 

i=l 1=1 4=1 

The last term goes to so in the limit 

n n 



i=l 2=1 

But from the definition of the maximum likelihood estimator, 



-' Y.^y^ - f{x{)f < n-' Y.{y, - Ux,)f 
and hence, as 



n 

j=l i=l 



n 

-1 ^ 

n 

i=l 

it follows that 



^ Y^iVi - foixi))' -> crl 



n 
— 1 ^ 

n 

i=l 



i=l 

Then, if /o is monotone, we have by definition 

n n 
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and, if /o is non-inontone, we have also by definition 

n n 

completing the proof. 



n 

i=l i=X 



Conditions under which (17) holds are to be found, for example, in Geman and Hwang 



(1982), using sieves. So define 



Sn = i f{x) : /(x) continuous, /'(x) piecewise continuous, / \f'{x)\'^dx < A^ 

for some increasing A„ f oo. The functions used in the present paper use the smaller 
sieve, S"^, detailed in the paper, but this does not make a difference since the sieve 
maximum likelihood estimator from Sn actually comes from S"^. 

On pages 10 and 11 and in Theorem 2 of a Brown University technical report in 
1981 by S. Geman on sieves for nonparametric estimation of densities and regressions, 
it is shown that fn{x), the minimizer of 

n 

subject to / G Sn, is piecewise linear, with knots at the (xj), and if 

exp{t\y\) Fridy) < oo 



for some t > 0, which we assume to be the case, then A„ = 0{n^'^ ^) for some 5 > 0, 
is sufficient for 

/ ifnix) — fo{x)] Q{dx) —7- almost surely 

It is then easy to show, assuming /q is bounded, that 

/ if nix) — foix)] Qn{dx) — )■ almost surely 



where Q„ is the empirical distribution of the (xj). Hence, (17) follows. 



1. /o is non-monotone. Let us first assume that foix), the true mean function 
on X, is non-monotone. That is, the true density is go{y\x,9Q) = A^(j/|/o(x), Uq). So 

consider 

lUliN{m\Mx,),a!)n,ide,) 



B 



12 



!Y{tiN{y,\h{x,),al)7:,{de,y 
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The denominator, with additional factor involving the true model, is given by 

J triN{yi\fo{xi),a'o) 
Hence, provided the prior for (/2,o"|) has /o in the Kullback-Leibler support, that is 

7i2\{f2,(T2) : f dK{N{-\f2{x),al),N{-\fo{x),(jl))Q{dx) <e] >0 



for all e > 0, then /„2 > e "'^ almost surely for all large n, for any r > 0. 
The numerator can be written as 

and /i belongs to the set of monotone functions. We need to show that /„i < e^"'' 
almost surely for all large n for some S > 0, and we can do this with the following 
two conditions: the first of which is from Lemma 1, 

1. It is that 

n 

liminf inf n^^ \ (yi — fi{xi)Y > CTq almost surely 
" /i ^ 

2. If the X are sampled from Q then, for some constant ip > 0, 

sup / exp {-ilj{fo{x) - fi{x))^} Q{dx) < 1. 
/i J 



First, we have 

Write these two terms as Kni and J„i, respectively. 
If we let fi{xi) be the minimizer of 

n 
i=l 

then the appropriate af is given by 

n 

n-'Y.{y.-h{x,)f. 



22 



Hence, 

n/2 



Kni = ° — ^ exp <^ -in + \nn ^ ^iVi - foi^'dY/^o 



i 

Since 

n 

n~^ ^{Vi - fo{xi)f/al -> 1 almost surely, 

i=l 

and using condition 1., it follows that Kni < e""^ almost surely for all large n, for any 
r] > 0. 

On the other hand, the expectation of J„i is given by 

EJ^^ = j ll-\ j dUN{-mx),al),N{-\f,{x),al))Q{dx)\ m{d9,). 

Now 

>l_exp{-^(/o(x)-/i(x))2} 

for some constant ip > 0, ii we impose an upper bound on the prior for af. 

Therefore, with condition 2., E J„i < e""'', for some /t > 0, and hence J„i < e""*^ 
almost surely for all large n, for some > 0. 

Hence, J„i < e"'^e~"''^ almost surely for all large n for any r] > and for some 
> 0, yielding /„i < e""*^ almost surely for all large n for some 6 > 0. Putting all 
this together 

Bi2 < exp{n(-5 + r)} 

almost surely for all large n for any r > and for some 6 > 0. So choose r < 5 to 
get the required result that B^ — )■ almost surely as n — > oo. 
Condition 2., i.e. 

sup / exp {-ip{fo{x) - fi{x)y} Q{dx) < 1 
/i J 

must hold since fo{x) is a fixed non-monotone function and it is not possible for a 
monotone function to get arbitrarily close to it. 

2. /o monotone. Now let us consider the reverse case when /o is a fixed monotone 
function. Following the same style of proof we have a number of considerations; 
amounting to the Kullback-Leibler condition, which we take as given, and. 



3. It is that 



n 



liminf inf n ^\ {Vi — f2{xi)Y > (Tq almost surely 



i=l 
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4. For some constant ip > and po < l? it is that 

Jexp{-^{f{x) - Ux)f}Q{dx)] n^m < p^ (18) 

for all large n. 

Condition 3. follows from Lemma 1. For condition 4., we can split the outer integral 
into two parts: one with 



exp{-^(/(a;) - foix)y}Q{dx) < 1 - e 
and the other with 

exp{~ilj{f{x) - foix)^} Q{dx) > 1 - e 



for a sufficiently small e. The case for the inner integral being bounded above by 1 — e 
is clear, so we need to show that 



7r2 



/2 : / exp {-^Uo{x) - f2{x)f} Q{dx) > 1 - e < p", (19) 



for some p < 1. 

Suppose we have Q{dx) which takes n equi-spaced samples (xi, . . . ,x„) and, for 
each /2, define k as the number of points for which \f2{xi) — fo{xi)\ < 6. The condition 
now translates to 

Tr2ik/n> 1-6*) < p" 

for some 6* > 0. To see this, and letting Zi = ■?/'(/o(xj) — f2{xi))'^, we need to consider 



n ^ ^ e ^' > 1 - e. 
Putting 0j = 1 — e^^% we need to consider 

n 



i=l 



and Zi < 6 implies (pi < 6. Now putting 6 = Me, for some M > 1, then we need at 
least k of the (pi to be less than 6 where 

n~^{n - k)Me < e. 

Thus, k>n{l- 1/M). 

Now P{k > n(l — 6*)) is bounded above by P{ku > n{l — S*)), where ku is the 
number of increasing terms for the non-monotone function. This follows since if the 
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derivative of fo{x) is bounded away from 0, then /(xj) can only be close to fo{xi) if 
the function increases in the interval immediately preceding Xj. 

Hence, for some pi, fc„ is binomial (pi,n), and using a normal approximation, 

ku/n^ N{pi,pi{l - pi)/n) 

and hence 

P{k > n{l - 5*)) < P{ku > n{l - 6*)) ^ P {z > i^) 

for some ^ > 0, and where z is a standard normal r.v. Using the asymptotic expression 
for the survival function of the normal cdf, we have 

P{ku > n{l - S*)) ^ -^ exp(-C2n) 



for positive constants ci and C2. Hence, (19) holds true, and so (18) holds true 
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